Common Vulnerabilities and Exposures (CVE) is a list of entries—each containing an identification number, a description, and at least one public reference—for publicly known cybersecurity vulnerabilities. This list is published in the National Vulnerability Database (NVD) and is maintained by NIST.
Currently when new CVEs are discovered and published on the NVD, they typically contain a paragraph of text--the 'description'--that describes the vulnerability, for example for CVE-2018-17189:
In Apache HTTP server versions 2.4.37 and prior, by sending request bodies in a slow loris way to plain resources, the h2 stream for that request unnecessarily occupied a server thread cleaning up that incoming data. This affects only HTTP/2 (mod_http2) connections.
NVD takes 3-5 business days to fill in the 'vendor' column with info--in this case the vendor would be apache.
This exercise to try and see if it is possible to derive the vendor by finetuning the GPT2 model to read the description text. This would allow automated classification of new CVEs without having to wait on NVD to supplement the details.
Main idea: Since GPT2 is a decoder transformer, the last token of the input sequence is used to make predictions about the next token that should follow the input. This means that the last token of the input sequence contains all the information needed in the prediction. With this in mind we can use that information to make a prediction in a classification task instead of generation task.
Previously, a LSTM model was used for this classification task and it had a validation accuracy of 93%
on 20 vendors. Using GPT, the goal is to increase the number of vendors (classes) whilst maintain high accuracy.